HyperHMM: efficient inference of evolutionary and progressive dynamics on hypercubic transition graphs

Abstract Motivation The evolution of bacterial drug resistance and other features in biology, the progression of cancer and other diseases and a wide range of broader questions can often be viewed as the sequential stochastic acquisition of binary traits (e.g. genetic changes, symptoms or characters). Using potentially noisy or incomplete data to learn the sequences by which such traits are acquired is a problem of general interest. The problem is complicated for large numbers of traits, which may, individually or synergistically, influence the probability of further acquisitions both positively and negatively. Hypercubic inference approaches, based on hidden Markov models on a hypercubic transition network, address these complications, but previous Bayesian instances can consume substantial time for converged results, limiting their practical use. Results Here, we introduce HyperHMM, an adapted Baum–Welch (expectation–maximization) algorithm for hypercubic inference with resampling to quantify uncertainty, and show that it allows orders-of-magnitude faster inference while making few practical sacrifices compared to previous hypercubic inference approaches. We show that HyperHMM allows any combination of traits to exert arbitrary positive or negative influence on the acquisition of other traits, relaxing a common limitation of only independent trait influences. We apply this approach to synthetic and biological datasets and discuss its more general application in learning evolutionary and progressive pathways. Availability and implementation Code for inference and visualization, and data for example cases, is freely available at https://github.com/StochasticBiology/hypercube-hmm. Supplementary information Supplementary data are available at Bioinformatics online.

N j=1 a i,j = 1, since we have to go to one of the states, and hence the probability of going from state i to any other state has to be 1.
The transition probability matrix, or just the transition matrix, is the matrix representation of all the transition probabilities: a 1,1 a 1,2 · · · a 1.N a 2,1 a 2,2 · · · a 2,N . . . . . . . . . . . . a N,1 a N,2 · · · a N,N      Definition .3 (Emission probabilities and emission matrix [2]). Let {X t , t ∈ T } be a stochastic process with state space S, and O be the sequence of observations drawn from the set {y 1 , ..., y m }. Then b i (y k ) = P (y k |X n = s i ) denotes the emission probabilities (or observation likelihoods). This is the probability of generating observation y k given that we are in state s i . Note that m k=1 b i (y k ) = 1.
The emission matrix is the matrix representation of all the emission probabilities:

.2 The Baum-Welch algorithm
A hidden Markov model (HMM) is a model wherein a system evolves according to an unobservable Markov chain, but emits signals that are observable. A given HMM is described by the A and B matrices above. The Baum-Welch algorithm, developed in the late 1960's and early 1970's, is a variant of the expectationmaximisation algorithm that estimates transition-and emission probabilities in a hidden Markov model, with very diverse use across fields including speech recognition, computational biology, computer vision and econometric [3].
To use the Baum-Welch algorithm one need a finite number of states and a sequence, or multiple sequences, of observations. The goal of the Baum-Welch algorithm is to use this information to estimate the transition probabilities in the underlying Markov chain, and the emission probabilities connected to the Markov chain. One important feature of the Baum-Welch algorithm is that it is only guaranteed to find a local optimum, not necessarily the global optimum, as is the case with many learning algorithms [3]. This means that we are not guaranteed the best solution every time, and that the result might depend on the initialisation of the algorithm.
We proceed by reviewing the original Baum-Welch algorithm, before describing the adaptations to the multiple signals, hypercubic case. At the core of the algorithm are several sub-functions, which are called α-, β-, ξ-, and γ-function, also referred to as α-, β-, ξ-, and γ-probabilities [4]. We write λ = A, B for the set of parameters describing a particular HMM, and q t for the state of the (unobservable) HMM at time t.
Definition .4 (α-probability). The α-probability is the probability of seeing a given set of observations up to time t and that we are in state i at time t given all of the hidden Markov model parameters λ: Definition .5 (β-probability). Given a state i at time t and the model parameters λ, the β-probability is the probability of observing all the observations starting from o t+1 , the observation at time t + 1, and going to o T , the observation at T : Definition .6 (ξ-probability). Given an observation sequence O = o 0 , ..., o T and the model parameters, the ξ-probability is the probability of being in state i at time t and state j at time t + 1: Definition .7 (γ-probability). Given an observation sequence O = o 0 , ..., o T and the model parameters, the γ-probability is the probability of being in state i at time t The ξ-function and γ-function is built up using the α-function and β-function, while the updating of transition and emission probabilities are done by using the ξ-function and γ-function.
The main idea of the Baum-Welch algorithm is to calculate the probability of seeing the observations when moving forward in time and when moving backward in time independently using the initial estimate of A and B. Then these calculations can be used to find a value for being at any given state at a given time. We can then normalize these values to find the new estimate of A and B. Iterating this process repeatedly until convergence is the Baum-Welch algorithm, given in Algorithm 1, with each step explained in more detail in Section .4.

.3 Multiple Sequence Baum-Welch Algorithm
The original Baum-Welch algorithm takes one observation sequence, O = o 0 , ..., o T . In hypercubic inference we typically have a dataset involving multiple independent evolutionary or progressive instances of a system, and are therefore interested in estimating parameters that best describe the dynamics of this ensemble of instances. We therefore require a multiple-sequence adaptation of the Baum-Welch algorithm.
Algorithm 1: The Baum-Welch algorithm [2]. Takes a set of T + 1 observations o 0 , ..., o T , estimates transition matrix A and emission matrix B.
Select a first estimation (could be randomly)Â andB.
In the single-sequence Baum-Welch we update the transition probabilities using the following formula: estimating the probability, summed over all times, of being in j the timestep after being in i, normalised by the total probability of being in any state k the timestep after being in i.
Let us use the following notation, O r = o r,0 , ..., o r,T , to denote the r'th observation sequence, and ξ r,t is the ξ-probability calculated using O r at time t. Note that all O r is independent of each other. Then we can update the transition probabilities like this instead: Which is the same as before, except that we now sum over all the independent observation sequences as well.

.4 Hypercubic Multiple Sequence Baum-Welch Algorithm
Our main objective is to infer transition probabilities over a directed hypercube given some data. The weight of the edges on the hypercube is the transition probabilities from node i to node j. We assume that the Markov condition holds -that is, the probability of going from one state to another is only dependent on the previous state. This allows the presence and absence of different traits to have arbitrary influence on the acquisition of further traits, but means that the history through which the existing traits were acquired does not influence future behaviour. This is intuitively appropriate for many biological situations where, for example, the fitness of an organism is a function of the organism itself and not its evolutionary history.
The state space in this case is the set of all nodes on the hypercube, meaning if we have L traits there are 2 L possible states to be in. With a complete transition graph this would imply an extremely large transition matrix for high L, with 2 L · 2 L values. However, since we are assuming a hypercubic structure -that is, that no entity can remove a trait once they have it, and that we can only get one trait at a time -this transition matrix is sparser. The number of allowed transitions are the number of edges on the hypercube, 2 L−1 L, which means we will have 2 2L − 2 L−1 L zero entries in the matrix.
The observation data we are considering will typically involve transitions between two states. This will either be the transition from the 0 L initial state to a given observation (in the case of cross-sectional data) or the transition from some precursor state to some subsequent state (in the case of longitudinal or phylogenetic data).
We begin by picturing each independent observed transition as a subset of a longer trajectory from the node of all zeroes to the node of all ones. As steps through our hypercubic state space involve the acquisition of one trait at a time, each such long trajectory will involve L steps. We use the wildcard character '?' to represent all unspecified states in a given trajectory. For example, a cross-sectional observation of 1001 would give the trajectory 0000 →? → 1001 →? → 1111, and a longitudinal observation 1100 → 1101 would give 0000 →? → 1100 → 1101 → 1111.
Writing the observation set in this way gives us some advantages. First, notice that o 0 always equals all 0's and o T always equals all 1's, and that every observation sequence, once we have added the unknown states, are of the same length. The question mark will from now on represent all possible states to be in at that time given the observation sequence.
Let us start by looking at how we can calculate the α-probabilities. In the normal Baum-Welch algorithm the α-probabilities is defined as follow, and 0 for all other states.
This gives the following recursive formula for calculating α t (i) when t > 0: In an unrestricted case, calculating and summing this value for every state would become intractable for large L. The hypercubic structure of the problem dramatically simplifies the calculation. Let us first look at the sum over all states j. Here, a j,i = 0 for every state j where it is not possible to move from j to i. This means that we only need to sum over all states where it is possible to move from j to i. Let us define this set as follows: If we know the state at time t, we only need to calculate α t (i) for that given state and set all other α t values to 0. However, if we do not know the state at time t, we need to calculate α t (i) for every i accessible from j where α t−1 (j) > 0. This set of accessible i states will be defined as F t as follows: Definition .9. i ∈ F t if and only if there exists a state j such that a j,i > 0 and α t−1 (j) > 0. This gives us the following expression for α t (i).
This calculation only applies to the case where o t is unknown. If o t , is known we would know that every α t (i) = 0 as long as i ̸ = o t . The calculation for α t (o t ) is however the same as the upper equation in Def. .10.
The β-probabilities are calculated in a similar way. We start by looking at the most simple case, when t = T . Since we do not have any observations after time T , we will define β T (i) to be 1 for every i ∈ S.
For t < T we have the following expression in the normal Baum-Welch algorithm: . This equation can be rewritten such that we gain a recursive formula, like in Def. . 10.
Like with the α-probabilities we do not need to calculate everything all the time, but the general rule for calculating the β-probabilities is: Definition .11. Let B t denote the set of all possible states to be in at time t. The set B t hence contains all binary strings which has t 1's.
For the ξ-probabilities we have to do something slightly different. The most simple case is when both o t and o t+1 are known. When this is the case, we have full information and know that ξ t (i, j) = 1 when i = o t and j = o t+1 . For every other value of i and j, ξ t (i, j) = 0.
In general, ξ t (i, j) can be calculated using the α-and β-probabilities. We know the probability of seeing everything up to time t, and being in state i at time t (α-probabilities), we know the probability of going from i to j (a i,j ), and we know the probability of seeing everything after time t given state j (β-probabilities). Then ξ t (i, j) will just be all these probabilities multiplied together.
As with the α-and β-probabilities we can use the known observation sequence to cut down the number of needed calculations. When o t is known and o t+1 is unknown, we only need to calculate for every j which is possible to reach from i. If the opposite is true, that we know o t+1 but not o t , we only need to calculate for every i where it is possible to go to j.
The last possible case is when none of the states are known. In this case we need to calculate ξ t (i, j) for every state i which is possible to be in at time t and every state it is possible to go to from that state.
Definition .13 (ξ-probabilities). For the ξ-probabilities we will have these four cases: If only o t is known:

0, otherwise
If only o t+1 is known: If both o t and o t+1 are unknown: Now to the updating of the transition probabilities. We intialise the algorithm with a uniform transition matrix, meaning each value in the matrix is a i,j = 1 # of states to go to from i , if it is possible to go from i to j, and 0 otherwise. We then need to calculate the ξ-probabilities for all of the given observation sequences. When we have the ξ-probabilities we can update our transition matrix. This is done the following way: where ξ t,r (i, j) is ξ t (i, j) for observation sequence r.
To illustrate the subsets of space involved in a calculation, Figure S2 shows all the nodes which will be given a value > 0 for (a) the α-probabilities, (b) the β-probabilities, and (c) the ξ-probabilities for the observation 1010.

A Hypercubic Baum-Welch example calculation
To illustrate the algorithm, we now consider an explicit example. Assume we are given 10100 as an observation in a cross-sectional context. That is interpreted as the sequence 00000 → ? → 10100 → ? → ? → 11111. We begin with a uniform transition matrix, and proceed to calculate the α-, β-, and ξ-probabilities. Calculations are described one timepoint at a time in the text, and illustrated in Fig. S3.

A.1 α calculations
We first need to calculate the forward probabilities, defined as follows: For t = 0 we have α 0 (00000) = 1 and α 0 (i) = 0, for all i except 00000. This will always be the case independent of the observation.
Here again we will only have a few possible non-zero values. Since α 3 (j) ̸ = 0 only for j ∈ {11100, 10110, 10101} only the places we can go from these three will have non-zero values for α 4 . Hence, if i ∈ {11110, 11101, 10111} then α 4 (i) = 1 30 · 1 2 + 1 30 · 1 2 = 1 30 . This is happening since for all j ∈ {11100, 10110, 10101} there is only two possible ways to go.  For t = 3 we have β 3 (i) = P (o 4 , o 5 = 11111|o 3 = i) = j P (o 5 = 11111|o 4 = j) · a i,j j β 4 (j) · a i,j . From t = 4 we have that β 4 (j) = 1 for j ∈ {11110, 11101, 11011, 10111, 01111} and 0 otherwise. Hence we only need to calculate the values for every state where it is possible to reach one of these five values. Since these five values are all the possible states to be in at time t = 4 and the transition matrix is uniform a i,j = 1 2 > 0 for every state that is possible to be in at time t = 3. We then end up with β 3 (i) = 1 2 + 1 2 = 1 for all i which is possible to be in at time t = 3.

A.3 ξ calculations
For t = 0 we have ξ 0 (00000, j) = P (o 0 = 00000, o 1 = j). This will only be non-zero for values where it is possible to go to from 00000. We also see that from the β-calculations that only the states 10000 and 00100 have non-zero values for β 1 so we only need to look at these two states. Again, since our transition matrix is uniform so every calculation will be similar. So, if j ∈ {10000, 00100} we have ξ 0 (00000, j) = α 0 (00000)a 00000,j β 1 (j) = Figure S3: Example edges considered in the HBW calculation. Given the observation 10100, (A) shows all the nodes that will have α t (i) > 0 for any time t; (C) shows all the nodes that will have β t (i) > 0 for any time t; (C) shows all the nodes that are included in both (A) and (B). All the edges connecting two nodes here will be given some weight when uploading the transition probabilities. For t = 1 we have ξ 1 (i, 10100) = P (o 1 = i, o 2 = 10100) = α 1 (i)a i,10100 β 2 (10100). From the α-calculations we know that α 1 (i) = 1 5 for i ∈ {10000, 01000, 00100, 00010, 00001}, but here we also see that only 10000 and 00100 will give a non-zero value for a i,10100 . Hence, we end up with ξ 1 (10000, 10100) = 1 5 · 1 4 · 1 = 1 20 and ξ 1 (00100, 10100) = 1 5 · 1 4 · 1 = 1 20 .

B Prior information and model selection
Although we cannot encode arbitrary prior information without a Bayesian picture, we can make some progress by encoding prior information in the structure of the hypercubic network that underlies our model. Say we know that trait a always appears before trait b. A simple way of including this prior information is by removing all edges in the hypercubic transition network that correspond to an acquisition of b from a state without trait a. The possibility of such a transition is thus removed from the inference process, and the HBW algorithm will identify the maximum likelihood parameterisation of the remaining edges given the observed data. This approach of removing edges also allows a degree of model selection to be applied. We can consider different models for a given system, involving hypercubic networks with different edge sets removed. The HBW algorithm will identify a maximum likelihood parameterisation for each (or, if a chosen model is incompatible        with observations, fail to identify any parameterisation and give a zero likelihood). The number of free parameters associated with a transition graph G is the number of outgoing edges d out (s) from each non-terminal node s minus one, summed over all nodes: k(G) = s∈G,dout(s)>0 (d out (s) − 1). k(G) for the full hypercube can readily be computed as L−1 n=0 L n (L − n − 1) = 1 − 2 L + 2 L−1 L. Given this parameter count and the associated likelihood, we can then use model selection approaches like the Akaike and Bayesian Information Criteria to choose parsimonious hypercube structures that are compatible with observations. As a simple example, consider the single and double pathway models above. We can consider three different model structures: (a) the full hypercube (as in all previous sections); (b) the hypercube with all edges removed except those on the single pathway 000... → 100... → 110... → ...; (c) the hypercube with all edges removed except those on the two pathways 000... → ...001 → ...011 → .... Then consider finding the maximum likelihood parameterisations of each of these models with (i) N samples from the single pathway system and (ii) N samples from the double pathway system. The results, along with associated AIC values, are given in Table. S2, demonstrating that models that are unnecessarily complex (a) or insufficiently complex (b)(ii) are penalised in favour of those that better match the observational structure (b)(i), (c)(ii).
The principle of edge removal can also be used to regularise identified models, as demonstrated in Ref. [5]. Here, the full hypercube model is first parameterised, then those edges with the lowest weights are removed and a model selection criterion recalculated. The process continues to iteratively remove low-probability edges until an unacceptable likelihood penalty results. While not flawless (for example, the best solution might involve removing a combination of edges that is not encountered in the iterative one-at-a-time approach), this approach can be used to lower the complexity of learned models given a set of observations.